Assignment 03

Author
Affiliation

Gavin Boss

Boston University

Published

September 20, 2025

Modified

September 24, 2025

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/24 01:13:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/24 01:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[Stage 0:>                                                          (0 + 1) / 1]                                                                                [Stage 1:>                                                          (0 + 1) / 1]                                                                                

1 Companies Table

[Stage 2:>                                                          (0 + 1) / 1]                                                                                
company_name company_raw company_is_staffing company_id
0 Crowe Crowe False 0
1 The Devereux Foundation The Devereux Foundation False 1
2 Elder Research Elder Research False 2
3 NTT DATA NTT DATA Inc False 3
4 Frederick National Laboratory For Cancer Research Frederick National Laboratory for Cancer Research False 4
25/09/24 01:13:41 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 5:>                                                          (0 + 1) / 1]                                                                                

2 Data Preparation (Clean Up Data)

[Stage 6:>                                                          (0 + 1) / 1]                                                                                [Stage 7:>                                                          (0 + 1) / 1]                                                                                [Stage 8:>                                                          (0 + 1) / 1]                                                                                [Stage 9:>                                                          (0 + 1) / 1]                                                                                
Medians : 87295.0 130042.0 115024.0
Data cleaning complete. Rows retained: 72498

3 Salary Distribution by Industry and Employment Type

[Stage 10:>                                                         (0 + 1) / 1]                                                                                
[Stage 11:>                                                         (0 + 1) / 1]                                                                                
EMPLOYMENT_TYPE_NAME SALARY
0 Part-time / full-time 92500.0
1 Full-time (> 32 hours) 110155.0
2 Full-time (> 32 hours) 92962.0
3 Full-time (> 32 hours) 107645.5
4 Full-time (> 32 hours) 192800.0
[Stage 12:>                                                         (0 + 1) / 1]                                                                                

4 3 Salary Analysis by ONET Occupation Type (Bubble Chart)

–Appendix 1: Asked Copilot to help, as my aggregation was not workiong correctly, but it was because of a mix of the aggregation and the sorting that we had done in the saturday help session. AI prompts attached.

[Stage 13:>                                                         (0 + 1) / 1]                                                                                

5 4 Salary by Education Level (Two Groups)

Create two groups:
    Associate’s or lower (GED, Associate, No Education Listed)
    Bachelor’s (Bachelor’s degree)
    Master’s (Master’s degree)
    PhD (PhD, Doctorate, professional degree)
Plot scatter plots for each group using, MAX_YEARS_EXPERIENCE (with jitter), Average_Salary, LOT_V6_SPECIALIZED_OCCUPATION_NAME
After each graph, add a short explanation of key insights.
[Stage 16:>                                                         (0 + 1) / 1]                                                                                

6 4 Salary by Education Level (Four Groups)

Create two groups:
    Associate’s or lower (GED, Associate, No Education Listed)
    Bachelor’s (Bachelor’s degree)
    Master’s (Master’s degree)
    PhD (PhD, Doctorate, professional degree)
Plot scatter plots for each group using, MAX_YEARS_EXPERIENCE (with jitter), Average_Salary, LOT_V6_SPECIALIZED_OCCUPATION_NAME
After each graph, add a short explanation of key insights.
[Stage 17:>                                                         (0 + 1) / 1]                                                                                

#see appendix 2 – asked ai to help me fix the data being in a straight line and it suggested the jitter.